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Abstract 

Commercial graphics processors (CPUs) have high com- 
pute capacity at very low cost, which makes them at- 
tractive for general purpose scientific computing. In this 
paper we show how graphics processors can be used for 
A'^-body simulations to obtain improvements in perfor- 
mance over current generation CPUs. We have devel- 
oped a highly optimized algorithm for performing the 
0{N'^) force calculations that constitute the major part 
of stellar and molecular dynamics simulations. In some 
of the calculations, wc achieve sustained performance of 
nearly 100 GFlops on an ATI X1900XTX. The perfor- 
mance on CPUs is comparable to specialized processors 
such as GRAPE-6A and MDGRAPE-3, but at a frac- 
tion of the cost. Furthermore, the wide availability of 
CPUs has significant implications for cluster computing 
and distributed computing efforts like Folding@IIome. 

Keywords: Celestial Mechanics, N-Body Simula- 
tions, Stellar Dynamics, Molecular Dynamics, Molecu- 
lar Simulation, Data Parallel Computing, Stream Com- 
puting, Programmable Graphics Hardware, GPU Com- 
puting, Brook 



1 Introduction 

The classical iV-body problem consists of obtaining the 
time evolution of a system of N mass particles interact- 
ing according to a given force law. The problem arises 
in several contexts, ranging from molecular scale calcu- 
lations in structural biology to stellar scale research in 
astrophysics. 

Molecular dynamics (MD) has been successfully 
used to understand how certain proteins fold and 
function, which have been outstanding questions 
in biology for over three decades [Snow et al. 20051 
IGomez et al. 2004] . Exciting new developments in 
MD methods offer hope that such calculations 



will play a significant role in future drug re- 
search |Fujitani et al. 2005| . In stellar dynamics where 
experimental observations are hard, if not impossible, 
theoretical calculations may often be the only way to 
understand the formation and evolution of galaxies. 

Analytic solutions to the equations of motion for more 
than 2 particles or complicated force functions are in- 
tractable which forces one to resort to computer simula- 
tions. A typical simulation consists of a force evaluation 
step, where the force law and the current configuration 
of the system are used to the compute the forces on 
each particle, and an update step, where the dynam- 
ical equations (usually Newton's laws) are numerically 
stepped forward in time using the computed forces. The 
updated configuration is then reused to calculate forces 
for the next time step and the cycle is repeated as many 
times as desired. 

The simplest force models are pairwise additive, that 
is the force of interaction between two particles is in- 
dependent of all the other particles, and the individual 
forces on a particle add linearly. The force calculation 
for such models is of complexity 0{N^). Since typical 
studies involve a large number of particles (10'^ to 10^) 
and the desired number of integration steps is usually 
very large (10® to 10^^), the computational requirements 
often limit both the problem size as well as the sim- 
ulation time and consequently, the useful information 
that may be obtained from such simulations. Numerous 
methods have been developed to deal with these issues. 
For molecular simulations, it is common to reduce the 
number of particles by treating the solvent molecules 
as a continuum. In stellar simulations, one uses indi- 
vidual time stepping or tree algorithms to minimize the 
number of force calculations. Despite such algorithmic 
approximations and optimizations, the computational 
capabilities of current hardware remain a limiting fac- 
tor. 

Typically iV-body simulations utilize neighborlists, 
tree methods or other algorithms to reduce the order of 
the force calculations. Previous work lElsen et al."2005] 
demonstrated a GPU implementation of a neighbor list 
based method to compute non-bonded forces. However, 
since the GPU so far outperformed the CPU, the neigh- 
bor list creation quickly became a limiting factor. Build- 
ing the ncighborlist on the GPU is extremely difficult 
due to the lack of specific abilities (namely indirected 



output) and research on computing the neighborUst on 
the GPU is still in progress. Other simplistic simula- 
tions that do not need neighborlist updates have been 
implemented by others [Juekuan Yang 2006] . However, 
for small N, we find we can do an O(iV^) calculation 
significantly faster on the GPU than an 0{N) method 
using the CPU (or even with a combination of the GPU 
and CPU). This has direct applicability to biological 
simulations that use continuum models for the solvent. 
We note also that in many of the reduced order meth- 
ods such as tree based schemes, at some stage an 0{N'^) 
calculation is performed on a subsystem of the particles, 
so our method can be used to improve the performance 
of such methods as well. When using GRAPE acceler- 
ator cards for tree based algorithms, the host processor 
takes care of building the tree and the accelerator cards 
are used to speed up the force calculation step; CPUs 
could be used in a similar way in place of the GRAPE 
accelerator boards. 

Using the methods described below, we are able to 
accelerate the force calculation on CPUs over 25 times 
compared to highly optimized SSE code running on an 
Intel Pentium 4. This performance is in the range of the 
specially designed GRAPE-6A [Fukushige et al. 2005] 
and MDGRAPE-3 |Taiji et al. 2003] processors, but 



uses a commodity processor at a much better perfor- 
mance/cost ratio. 

2 Algorithm 

General purpose CPUs are designed for a wide vari- 
ety of applications and take limited advantage of the 
inherent parallelism in many calculations. Improving 
performance in the past has relied on increasing clock 
speeds and the size of high speed cache memories. Pro- 
gramming a CPU for high performance scientific appli- 
cations involves careful data layout to utilize the cache 
optimally and careful scheduling of instructions. 

In contrast, graphics processors are designed for in- 
trinsically parallel operations, such as shading pixels, 
where the computations on one pixel are completely in- 
dependent of another. GPUs are an example of stream- 
ing processors, which use explicit data parallelism to 
provide high compute performance and hide memory 
latency. Data is expressed as streams and data par- 
allel operations are expressed as kernels. Kernels can 
be thought of as functions that transform each element 
of an input stream into a corresponding element of an 
output stream. When expressed this way, the kernel 
function can be applied to multiple elements of the in- 
put stream in parallel. Instead of blocking data to fit 
caches, the data is streamed into the compute units. 
Since streaming fetches are predetermined, data can be 
fetched in parallel with computation. We describe be- 
low how the iV-body force calculation can be mapped 



to streaming architectures. 

In its simplest form the iV-body force calculation can 
be described by the following pseudo-code: 

for i = 1 to N 
force [i] = 
ri = coordinates [i] 
for j = 1 to N 

rj = coordinates [j] 

force [i] = force [i] + f orce_f unctionC ri , rj ) 

end 

end 

Since all coordinates are fixed during the force calcula- 
tion, the force computation can be parallelized for the 
different values of i. In terms of streams and kernels, 
this can be expressed as follows: 

stream coordinates; 
stream forces; 

kernel kforceC ri ) 
force = 
for j = 1 to N 

rj = coordinates [j] 

force = force + force_f unctionC ri, rj ) 

end 

return force 
end kernel 

forces = kforceC coordinates ) 

The kernel kf orce is applied to each element of the 
stream coordinates to produce an element of the 
forces stream. Note that the kernel can perform an 
indexed fetch from the coordinates stream inside the 
j-loop. An out-of-order indexed fetch can be slow, since 
in general, there is no way to prefetch the data. However 
in this case the indexed accesses are sequential. More- 
over, the j-loop is executed simultaneously for many i- 
elements; even with minimal caching, r j can be reused 
for many N i-elements without fetching from memory 
thus the performance of this algorithm would be ex- 
pected to be high. The implementation of this algorithm 
on GPUs and CPU-specific performance optimizations 
are described in the following section. 

There is however one caveat in using a streaming 
model. Newton's Third law states that the force on 
particle i due to particle j is the negative of the force 
on particle j due to particle i. CPU implementations 
use this fact to halve the number of force calculations. 
However, in the streaming model, the kernel has no 
ability to write an out-of-sequence element (scatter) , so 
forces [j] can not be updated while summing over the 
j-loop to calculate forces [i]. This effectively doubles 
the number of computations that must be done on the 
GPU compared to a CPU. 

Several commonly used force functions were imple- 
mented to measure and compare performance. For stel- 
lar dynamics, depending on the integration scheme be- 
ing used, one may need to compute just the forces, or 
the forces as well as the time derivative of the forces 
(jerk). We have designated the corresponding kernels 
GA (Gravitational Acceleration) and GAJ (Gravita- 
tional Acceleration and Jerk). In molecular dynamics. 



it is not practical to use 0{N'^) approaches when the 
solvent is treated explicitly, so we restrict ourselves to 
continuum solvent models. In such models, the quan- 
tum interaction of non-bonded atoms is given by a 
Lennard- Jones function and the electrostatic interac- 
tion is given by Coulomb's Law suitably modified to 
account for the solvent. The LJC (constant) kernel cal- 
culates the Coulomb force with a constant dielectric, 
while the LJC(linear) and LJC(sigmoidal) kernels use 
distance dependent dielectrics. The equations used for 
each kernel as well as the arithmetic complexity of the 
calculation are shown in Table [T] 

3 Implementation and Optimiza- 
tion on CPUs 

3.1 Brook 

BrookGPU |Buck et al. 2004b| is a C-like high-level lan- 
guage that can be used to program CPUs as stream- 
ing processors. Streams are stored as textures and 
kernels are implemented as fragment programs. The 
BrookGPU run-time library can utilize a number of 
graphics interfaces; for this work we used the Microsoft 
DirectX 9.0c API and the Pixel Shader 3.0 specifica- 
tion [Microsoft 2006] . DirectX [Microsoft 2003] pro- 
vides a vendor-independent abstraction of hardware 
features. In the Pixel Shader 3.0 specification, the 
shader has access to 32 general purpose, 4-component, 
single precision floating point (float4) registers, 16 
f loat4 input textures, 4 f loat4 render targets (output 
streams) and 32 f loat4 constant registers. A shader 
consists of a number of assembly-like instructions. Cur- 
rent CPUs have a maximum static program length of 
512 (ATI) or 1024 (NVIDIA) instructions. There is a 
loop limit of 255 iterations of a loop body, but loops can 
be nested to increase the total numbers of iterations. 
NVIDIA is limited to 65,535 dynamic instructions and 
ATI can support an unlimited number. The BrookGPU 
compiler translates kernels into a high level shader lan- 
guage like CG or HLSL, which is then compiled into 
pixel shader assembly by an appropriate shader com- 
piler like Microsoft's fxc or NVIDIA's cgc. The graph- 
ics driver finally maps the Pixel Shader assembly code 
into hardware instructions as appropriate to the archi- 
tecture. 

3.2 Precision 

Recent graphics boards have 32-bit floating point arith- 
metic. Consequently we have done all the calculations 
in single precision. Whether or not this is sufficiently ac- 
curate for the answers being sought from the simulation 
is often a subject of much debate and the authors do not 
intend to settle it here. We are of the opinion that in 



many cases, though certainly not all, single precision is 
enough to obtain useful results. Furthermore, if double 
precision is necessary, it is usually not required through- 
out the calculation, but rather only in a select few in- 
stances. For reference, GRAPE-6 [Makino et al. 200"3a] 
performs the accumulation of accelerations, subtraction 
of position vectors and update of positions in 64-bit fixed 
point arithmetic with everything else in either 36, 32 or 
29 bit floating point precision. It is quite common to do 
the entire force calculation in single precision for molec- 
ular simulations while using double precision for some 
operations in the update step. If and where necessary, 
the appropriate precision could be emulated on graph- 
ics boards jGoddeke et al. 2005] . The impact on perfor- 
mance would depend on where and how often it would 
be necessary to do calculations in double precision. 

3.3 General Optimization 

The algorithm was implemented for several force mod- 
els. For simplicity, in the following discussion, we 
only talk about the GA kernel, which corresponds to 
the gravitational attraction between two mass particles, 
given by 

^'ij ' 

where ai is the acceleration on particle i, G is a constant 
(often normalized to one), nij is the mass of particle j, 
and Tij is the vector displacement between particles i 
and j. The performance of the kernel for various input 
sizes are shown in Figure [TJ 

The algorithm outlined in Section [5] was imple- 
mented in BrookGPU and targeted for the ATI 
X1900XTX. Even this naive implementation per- 
forms very well, achieving over 40 GFlops, but its 
performance can be improved. This kernel ex- 
ecutes 48 Giga- instructions/sec and has a mem- 
ory bandwidth of 33 GB/sec. Using information 
from GPUBench [Buck et al. 2004a] . we expect the 
X1900XTX to be able to execute approximately 30-50 
Giga-instruction/sec (it depends heavily on the pipelin- 
ing of commands) and have a cache memory bandwidth 
of 41GB/sec. The nature of the algorithm is such that 
almost all the memory reads will be from the cache since 
all the pixels being rendered at a given time will be ac- 
cessing the same j-particle. Thus this kernel is limited 
by the rate at which the GPU can issue instructions 
(compute bound). 

To achieve higher performance, we used the standard 
technique of loop unrolling. This naive implementation 
is designated as a 1 x 1 kernel because it is not unrolled 
in cither i or j. The convention followed hereafter when 
designating the amount of unrolling will be that AxB 
means i unrolled A times and j unrolled B times. The 
second GA kernel (1x4) which was written unrolled the 



Kernel 



Formula 



Flops Input Inner BW Useful Giga System 

per Unroll (bytes) Loop (GB/s) GFLOPS Intrxns Size 
Intrxn. Insns. per sec. 



Gravity 
(accel) 



(r2.+e2)3/2 ''J 



19 



4x4 64 125 19.9 



94.3 



4.97 65,536 



Gravity 
(accel & jerk) 



(^2 +,2)3/2 "J 
ij 

^ij o ^^ij '^ij ^^ij 



(r2.+e2)3/2 (^2^,^2)5/2 



42 1x4 128 104 40.6 



53.5 



1.27 65,536 



LJC 

(constant) 



30 2x4 104 109 33.6 



77.6 



2.59 4096 



LJC 
(linear) 



30 



2x4 104 107 34.5 



79.5 



2.65 4096 



LJC 

(sigmoidal) 



C(r) : 



43 2x4 104 138 27.3 



90.3 



2.10 4096 



Tabic 1: Values for the maximum performance of each kernel on the X1900XTX. The instructions are counted as the 
number of pixel shader assembly arithmetic instructions in the inner loop. Intrxn = Interaction; Insns — Instructions; 
BW = Bandwidth. 
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Figure 1: GA Kernel with varying amounts of unrolling 

j-loop four times, enabling the use of the 4-way SIMD 
instructions on the GPU. This reduces instructions that 
must be issued by around a factor of 3. (We cannot 
reduce instructions by a factor of 4 because some Pixel 
Shader instructions are scalar). The performance for 
this kernel is shown in Figure [T] It achieves a modest 
speedup compared to the previous one, and we have 
now switched from being compute bound to bandwidth 
bound (35 Giga-Instructions/scc and «40GB/sec). 

Further reducing bandwidth usage is somewhat more 
difficult. It involves using the multiple render targets 
(MKT) capability of recent GPUs which is abstracted as 
multiple output streams by BrookGPU. By reading in 
4 i-particlcs into each kernel invocation and outputting 
the force on each into a separate output stream, we re- 
duce by a factor of four the size of each output stream 
compared with original. This reduces input bandwidth 



requirements to one quarter of original bandwidth be- 
cause each j-particle is only read by one-quarter as 
many fragments. To make this more clear, we show 
the pseudo-code for this kernel below. This kernel is 
designated as a 4x4 kernel. 

stream coordinates; 

stream index = range C 1 to N skip 4 ); 
stream forcesl, forces2, forces4, forces4; 

kernel kforce4x4( i ) 
forcel = 
force2 = 
forces = 
force4 = 
ril = coordinates [i] 
ri2 = coordinates [i+1] 
ri3 = coordinates [i+2] 
ri4 = coordinates [i+3] 
for j = 1 to N skip 4 

rjl = coordinates [j] 

rj2 = coordinates + 

rj3 = coordinates [j+2] 

rj4 = coordinates [j+3] 

forcel += f orce_function4C ril, rjl, rj2, rj3, rj4 ) 
force2 += f orce_function4C ri2, rjl, rj2, rj3, rj4 ) 
force3 += f orce_function4C ri3, rjl, rj2, rj3, rj4 ) 
force4 += f orce_function4C ri4, rjl, rj2, rj3, rj4 ) 

end 

return forcel, force2, force3, force4 
end kernel 

forcesl, forces2, forces3, forces4 = kforce4x4C indices ) 

In the above code, the input is the sequence of in- 
tegers l,5,9,...iV and the output is 4 force streams, 
f orce jfunction4 uses the 4-way SIMD math available 
on the GPU to compute 4 forces at a time. The four 
output streams can be trivially merged into a single one 
if needed. Results for this kernel can be seen in Figure[T] 



Once more the kernel has become instruction-rate hm- 
ited and its bandwidth is half that of the maximum 
bandwidth of the ATI board, but the overall perfor- 
mance has increased significantly. 

3.4 Optimization for small systems 

In all cases, performance is severely limited when the 
number of particles is less than about 4000. This is due 
to a combination of fixed overhead in executing ker- 
nels and the lack of sufficiently many parallel threads 
of execution. It is sometimes necessary to process small 
systems or subsystems of particles {N w 100 — 1000). 

For example, in molecular dynamics where forces 
tend to be short-range in nature, it is more common 
to use 0{N) methods by neglecting or approximating 
the interactions beyond a certain cutoff distance. How- 
ever, when using continuum solvent models, the num- 
ber of particles is small enough {N « 1000) that the 
0{N^) method is comparable in complexity while giv- 
ing greater accuracy than 0{N) methods. 

It is common in stellar dynamics to parallelize the 
individual time step scheme by using the block time 
step method [McMillan 1986| . In this method forces 
are calculated on only a subset of the particles at any 
one time. In some simulations a small core can form 
such that the smallest subset might have less than 1000 
particles in it. To take maximal advantage of CPUs it is 
therefore important to get good performance for small 
output stream sizes. 

To do this, we can increase the number of parallel 
threads by decreasing the j-loop length. For example, 
the input stream can be replicated twice, with the j-loop 
looping over the first N/2 particles for the first half of 
the replicated stream and looping over the second N/2 
particles for the second half of the stream. Consider the 
following pseudocode that replicates the stream size by 
a factor of 2: 



stream coordinates; 

stream indices = range C 1 to 2M ) ; 

stream partial_f orces ; 

kernel kforceC i ) 
force = 
if i <= K: 

ri = coordinates Ci] 
for j = 1 to N/2 

rj = coordinates [j] 

force = force + f orce_f unctionC ri, rj ) 

end 

else 

ri = coordinates [i-N+1] 
for j = N/2+1 to N 

rj = coordinates [j] 

force = force + f orce_functionC ri, rj ) 

end 

endif 

return force 
end kernel 

partial_f orces = kforceC indices ) 




2 4 6 

Replication of i-particles 

Figure 2: Performance improvement for LJC(sigmoidal) 
kernel with i-particle replication for several values of N 

In this example, the stream indices is twice as long 
as the coordinates stream and contains integers in se- 
quence from 1 to 2N. After applying the kernel kf orce 
on indices to get partiaUorces, the force on particle 
i can be obtained with by adding partiaUorces [i] 
and partiaUorces [i+N] , which can be expressed as 
a trivial kernel. The performance of the LJC(sigmoidal) 
kernel for different number of replications of the i- 
particles is shown in Figure [2] for several system sizes. 



4 Results 

All kernels were run on an ATI X1900XTX PCIe graph- 
ics card on Dell Dimension 8400 with ATI Catalyst ver- 
sion 7.2 drivers and the latest DirectX SDK (December 
2006). A number of different force models were im- 
plemented with varying compute-to-bandwidth ratios 
(see Table [Ij. A sample code listing is provided in 
the appendix (|A.1[) to show the details of how flops are 
counted. 

To compare against the CPU, a specially optimized 
version of the GA and GAJ kernels were written since 
no software suitable for a direct comparison to the GPU 
existed. The work of [Nitadori et al. 2005] uses SSE for 
the GAJ kernel but does some parts of the calculation in 
double precision which makes it unsuitable for a direct 
comparison. The performance they achieved is com- 
parable to the performance achieved here. Using SSE 
intrinsics and Intel's C-f-l- Compiler v9.0, we were able 
to obtain sustained performance of 3.8 GFlops on a 3.0 
GHz Pentium 4. 

GROMACS ILindahl et al. 2001] is currently the 
fastest performing molecular dynamics software with 
hand-written SSE assembly loops. As mentioned in Sec- 
tion[5]thc CPU can do out-of-order writes without a sig- 
nificant penalty. GROMACS uses this fact to halve the 
number of calculations needed in each force calculation 
step. In the comparison against the GPU in Table[2]thc 
interactions per second as reported by GROMACS have 



been doubled to reflect this. In MD it is common to use 
neighborlists to reduce the order of the force computa- 
tion to 0{N). The performance of GROMACS doing 
an 0{N'^) calculation as well as an 0{N) calculation 
for a 80 residue protein (lambda repressor, 1280 atoms) 
is shown in TablcJ^l Despite using a fairly modest cutoff 
length of 1.2 nm for the 0{N) calculation, the 0{N'^) 
GPU calculation represents an order-of-magnitude per- 
formance improvement over existing methods on CPUs. 



5 Discussion 

5.1 Comparison to other Architectures 

In Figure [3] is a comparison of interactions/sec be- 
tween the ATI X1900XTX, GRAPE-6A and a Pentium 
4 3.0GHz. The numbers for the GPU and CPU are 
observed values, those for GRAPE-6A are for its theo- 
retical peak. Compared to GRAPE-6A, the GPU can 
calculate over twice as many interactions when only the 
acceleration is computed, and a little over half as many 
when both the acceleration and jerk arc computed. The 
GPU bests the CPU by 35x, 39x and 15x for the GA, 
LJC(constant) and GAJ kernels respectively. 

Another important metric is performance per unit 
of power dissipated. These results can be seen in 
Figure [5l Here the custom design and much smaller 
on-board memory allows GRAPE-6A to better the 
GPU by a factor of 4 for the GAJ kernel, although 
they arc still about equal for the GA kernel. The 
power dissipation of the Intel Pentium 4 3.0 GHz 
is 82W [Intel 2006] . the X1900XTX is measured to 
be 85W, we estimate GRAPE-6A's dissipation to be 
48W since each of the 4 processing chips on the board 
dissipates approximately 12W [Makino et al. 2003b| 
and MDGRAPE-3's (MD3-PCIX) dissipation is 
40W |Peta Computing Institute 2006| . 

The advantages of the GPU become readily ap- 
parent when the metric of performance per dol- 
lar is examined (Figure [4]). The current price of 
an Intel Pentium 4 630 3.0GHz is $100, an ATI 
X1900X TX is $350, and an MDGRAPE -3 board costs 
$16000 |Peta Computing Institute 2006| . The GPU 
outperforms GRAPE-6A by a factor of 22 for the GA 
kernel and 6 for the GAJ kernel. 



5.2 Hardware Constraints 

The 4x4 unrolling that is possible with the GA kernel 
does not work for the other, more complicated kernels. 
For example, the GAJ kernel requires two outputs per 
particle (jerk in addition to acceleration). This reduces 
the maximum unrolling possibility to 2x4 because the 
GPU is limited to a maximum of 4 outputs per kernel. 
However, even this amount of unrolling doesn't work 
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Figure 3: Speed comparison of CPU, GPU, GRAPE-6A 
and MDGRAPE-3 
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Figure 4: Useful MFlops per second per U.S. Dollar of 
CPU, GPU, GRAPE-6A and MDGRAPE-3 
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Figure 5: Millions of Interactions per Watt of CPU, 
GPU, GRAPE-6A and MDGRAPE-3 



Kernel 


GMX Million 
Interactions / sec 


GMX 0{N^) 
ns/day 


GMX 0(N) 
ns/day 


GPU Million 
Interactions/ sec 


GPU 

ns/day 


LJC(constant) 


66 


5.6 


18.2 


1327 


140 


LJC(linear)* 


33 


2.06 


9.08 


1327 


140 


LJC(sigmoidal) 


40 


2.5 


11 


1203 


127 



Table 2: Comparison of GRO MACS (GMX) running on a 3.2 GHz Pentium 4 vs. the GPU showing the simulation time 
per day for an 80 residue protein (lambda repressor) *GROMACS does not have an SSE inner loop for LJC(linear) 



because the compiler cannot fit the kernel within the 32 
available registers. The number of registers is also what 
prevents the LJC kernels from being unrolled by 4x4 
instead of 2x4. 

This apparent limitation due to the number of regis- 
ters appears to result from compiler inefhcieneies; the 
authors are currently hand coding a 2 x 4 G A J kernel di- 
rectly in pixel shader assembly which should cause the 
kernel to become compute bound and greatly increase 
its performance. The performance gain of unrolling the 
LJC kernels to 4x4 by rewriting them in assembly would 
most likely be small since these kernels are already com- 
pute bound. 

While the maximum texture size of 4096x4096 and 
512 MB would make it possible to store up to 16 million 
particles on the board at a time, this really isn't neces- 
sary. In fact, GRAPE-6A only has storage for 131,000 
particles on the board at any one time. This is small 
enough to occasionally seem restrictive - a good balance 
is around 1 million particles which could easily be ac- 
commodated by 64MB. If board manufacturers wanted 
to produce cheaper boards specifically for use in these 
kinds of computations they could significantly reduce 
the cost without affecting the functionality by reducing 
the amount of onboard RAM. 

The current limits on the number of instructions also 
impacts the efficiency of large GPGPU programs. On 
ATI hardware, the maximum shader length of 512 in- 
structions limits the amount of loop unrolling and the 
complexity of the force functions we can handle. 

5.3 On-board Memory vs. Cache Usage 



As mentioned in Section 13.31 we expect the kernels to 
make very efficient use of the cache on the boards. There 
are a maximum of 512 threads in flight on the ATI 
X1900XTX at any one time |ATI 2006] . and in the ideal 
situation, each of these threads will try and access the 
same j-particle at approximately the same time. The 
first thread to request a j-particle will miss the cache and 
cause the particle to be fetched from on-board memory, 
however once it is in the cache, all the threads should 
be able to read it without it having to be fetched from 
on-board memory again. 

For example, in the case of the GA kernel with 65,536 
particles, there would be 16,384 fragments to be pro- 
cessed, and if fragments were processed in perfectly sep- 
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Figure 6: GFlops achieved as a function of memory 
speed 



arate groups of 512, then 32 groups would need to be 
processed. Each group would need to bring in 65,536 
particles from main memory to the cache resulting in an 
extremely low memory bandwidth requirement of 38.2 
MB/sec. 

Of course, the reality is that particles are not pro- 
cessed in perfectly separate groups of 512 particles that 
all request the same particle at the same time, but by us- 
ing ATITool [ATITool 20"06] to adjust the memory clock 
of the board we can determine how much bandwidth 
each kernel actually needs to main memory. The re- 
sults of this testing can be seen in Figure [S] 

The performance degradation occurs at approxi- 
mately 11.3, 5.2, and 2.1 GB/sec for the LJC, GAJ 
and GA kernels respectively. The LJC kernels must also 
read in an exclusion list for each particle which does not 
cache as well as the other reads, and is the reason why 
their bandwidth to main memory is higher than that 
of the gravity kernels. The number for the GA kernel 
suggests that approximately 10 particles are accessing 
the same j-particle at once. 

At memory speeds above 500MHz all the kernels run 
very near their peak speed, thus board manufactur- 
ers could not only use less RAM, they could also use 
cheaper RAM if they were to produce a number of 
boards that would only be used for these calculations. 
This would reduce the cost and power requirements over 
the standard high end versions used for gaming. 



5.4 Distributed Computation 



A Appendix 



Most biological phenomena of interest occur on 
timescales currently beyond the reach of MD simula- 
tions. For example, the simplest proteins fold on a 
timescale of 5 to 20 microseconds, while more complex 
proteins may take milliseconds to seconds. MD simu- 
lations on current generation CPUs are usually limited 
to simulating about 10 nanoseconds per day - it would 
take several years to obtain a 10 microsecond simula- 
tion. However, with the speed increases afforded by the 
algorithms and hardware discussed here, we are now be 
able to simulate protein dynamics with individual tra- 
jectories on the 10 microsecond timescale in under three 
months. This will allow the direct simulation of the fold- 
ing of fast-folding proteins. Moreover, by incorporating 
this methodology into a distributed computing frame- 
work, we are now situated to build Markovian State 
Models to simulate even longer timescales, likely ap- 



proaching seconds Jaychandran et al. 2006 . Thus with 



the combined effort of CPUs and distributed comput- 
ing, one would be able to reach timescales for folding of 
essentially all single-domain, two-state folding proteins. 
Compared to the donations of CPUs from over 150,000 
Windows computers currently producing 145 TFlops, 
we have 550 CPUs donated to the project producing 
over 34 TFlops. Thus each GPU is providing roughly 
60 times the performance of the average donated x86 
CPU. 



6 Conclusion 

We have successfully taken advantage of the processing 
power available on CPUs to accelerate pairwise force 
calculations for several commonly used force models in 
stellar and molecular dynamics simulations. In some 
cases the GPU is more than 25 times as fast as a highly 
optimized SSE-based CPU implementation and exceeds 
the performance of custom processors specifically de- 
signed for these tasks such as GRAPE-6A. Further- 
more, our performance is compute bound, so we are 
well poised to take advantage of further increases in the 
number of ALUs on CPUs, even if memory subsystem 
speeds do not increase significantly. Because CPUs are 
mass produced, they are relatively inexpensive and their 
performance to cost ratio is an order of magnitude bet- 
ter than the alternatives. The wide availability of CPUs 
will allow distributed computing initiatives like Fold- 
ing@Home to utilize the combined processing power of 
tens of thousands of CPUs to address problems in struc- 
tural biology that were hitherto computationally infea- 
sible. We believe that the future will see some truly 
exciting applications of CPUs to scientific computing. 



A.l Flops Accounting 

To detail how we count flops we present a snippet of 
the actual Brook code for the GA kernel. The calcula- 
tion of the acceleration on the first i-particle has been 
commented with our flop counts for each instruction. 
In total, the calculation of the acceleration on the first 
i-particle performs 76 flops. Since four interactions are 
computed, this amounts to 19 flops per interaction. 



floats dl, d2, d3, d4, outaccell; 

float4 jmass, r, rinv, rinvcubed, scalar; 

dl = jposl - iposl; //3 

d2 = jpos2 - iposl; //3 

dS = jposS - iposl; //3 

d4 = jpos4 - iposl; //3 

r.x = dct( dl, dl ) + eps; //6 

r.y = dot( d2, d2 ) + eps; //6 

r.z = dot( d3, d3 ) + eps; //6 

r.w = dct( d4, d4 ) + eps; //6 

rinv = rsqrt( r ); //4 

rinvcubed = rinv*rinv*rinv; //8 

scalar = jmass * rinvcubed; //4 

outaccell += scalar. y * d2 + scalar. z * d3 + scalar. w * d4; //18 



if ( Ilist.x 
outaccell 

} 



Jlistl . 
scalar . 



//don 
//6 



t add force due to ourself 
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